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Abstract 


We propose Kernel Hamiltonian Monte Carlo (KMC), a gradient-free adaptive 
MCMC algorithm based on Hamiltonian Monte Carlo (HMC). On target densities 
where classical HMC is not an option due to intractable gradients, KMC adap¬ 
tively learns the target’s gradient structure by fitting an exponential family model 
in a Reproducing Kernel Hilbert Space. Computational costs are reduced by two 
novel efficient approximations to this gradient. While being asymptotically exact, 

KMC mimics HMC in terms of sampling efficiency, and offers substantial mixing 
improvements over state-of-the-art gradient free samplers. We support our claims 
with experimental studies on both toy and real-world applications, including Ap¬ 
proximate Bayesian Computation and exact-approximate MCMC. 

1 Introduction 

Estimating expectations using Markov Chain Monte Carlo (MCMC) is a fundamental approximate 
inference technique in Bayesian statistics. MCMC itself can be computationally demanding, and 
the expected estimation error depends directly on the correlation between successive points in the 
Markov chain. Therefore, efficiency can be achieved by taking large steps with high probability. 

Hamiltonian Monte Carlo H] is an MCMC algorithm that improves efficiency by exploiting gra¬ 
dient information. It simulates particle movement along the contour lines of a dynamical system 
constructed from the target density. Projections of these trajectories cover wide parts of the target’s 
support, and the probability of accepting a move along a trajectory is often close to one. Remark¬ 
ably, this property is mostly invariant to growing dimensionality, and HMC here often is superior to 
random walk methods, which need to decrease their step size at a much faster rate m Sec. 4.4]. 

Unfortunately, for a large class of problems, gradient information is not available. For example, in 
Pseudo-Marginal MCMC (PM-MCMC) Ellll, the posterior does not have an analytic expression, 
but can only be estimated at any given point, e.g. in Bayesian Gaussian Process classification lH. A 
related setting is MCMC for Approximate Bayesian Computation (ABC-MCMC), where the pos¬ 
terior is approximated through repeated simulation from a likelihood model HJISl. In both cases, 
HMC cannot be applied, leaving random walk methods as the only mature alternative. There have 
been efforts to mimic HMC’s behaviour using stochastic gradients from mini-batches in Big Data 
IfTl . or stochastic finite differences in ABC IH. Stochastic gradient based HMC methods, however, 
often suffer from low acceptance rates or additional bias that is hard to quantify Q. 

Random walk methods can be tuned by matching scaling of steps and target. For example. Adaptive 
Metropolis-Hastings (AMH) ifTOl [TTIl is based on learning the global scaling of the target from the 
history of the Markov chain. Yet, for densities with nonlinear support, this approach does not work 
very well. Recently, ifT^ introduced a Kernel Adaptive Metropolis-Hastings (KAMH) algorithm 
whose proposals are locally aligned to the target. By adaptively learning target covariance in a 
Reproducing Kernel Hilbert Space (RKHS), KAMH achieves improved sampling efficiency. 
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In this paper, we extend the idea of using kernel methods to learn efficient proposal distributions m. 
Rather than locally smoothing the target density, however, we estimate its gradients globally. More 
precisely, we fit an infinite dimensional exponential family model in an RKHS via score matching 
iiniiii. This is a non-parametric method of modelling the log unnotmalised target density as an 
RKHS function, and has been shown to approximate a rich class of density functions arbitrarily well. 
More importantly, the method has been empirically observed to be relatively robust to increasing 
dimensionality - in sharp contrast to classical kernel density estimation m Sec. 6.5]. Gaussian 
Processes (GP) were also used in lfl6l as an emulator of the target density in order to speed up 
HMC, however, this requires access to the target in closed form, to provide training points for the 
GP. 

We require our adaptive KMC algorithm to be computationally efficient, as it deals with high¬ 
dimensional MCMC chains of growing length. We develop two novel approximations to the infinite 
dimensional exponential family model. The first approximation, score matching lite, is based on 
computing the solution in terms of a lower dimensional, yet growing, subspace in the RKHS. KMC 
with score matching lite {KMC lite) is geometrically ergodic on the same class of targets as stan¬ 
dard random walks. The second approximation uses a finite dimensional feature space {KMCfinite), 
combined with random Fourier features im. KMC finite is an efficient online estimator that allows 
to use all of the Markov chain history, at the cost of decreased efficiency in unexplored regions. A 
choice between KMC lite and KMC hnite ultimately depends on the ability to initialise the sampler 
within high-density regions of the target; alternatively, the two approaches could be combined. 

Experiments show that KMC inherits the efficiency of HMC, and therefore mixes significantly better 
than state-of-the-art gradient-free adaptive samplers on a number of target densities, including on 
synthetic examples, and when used in PM-MCMC and ABC-MCMC. All code can be found at 

https://github.com/karlnapf/kernel_hmc 

2 Background and Previous Work 

Let the domain of interest A be a compacj^ subset of and denote the unnormalised target den¬ 

sity on X by tt. We are interested in constructing a Markov chain xi X 2 ... such that 
limt_>.oo Xt ~ TT. By running the Markov chain for a long time T, we can consistently approximate 
any expectation w.r.t. tt. Markov chains are constructed using the Metropolis-Hastings algorithm, 
which at the current state Xt draws a point from a proposal mechanism x* ~ Q{-\xt), and sets 
Xt+i ^ X* with probability miffil, [7r(a;*)(5(a;t|a;*)]/[7r(a;t)(5(a:*|a;t)]), and Xt+i ^ Xt otherwise. 
We assume that tt is intractable ffi.e. that we can neither evaluate tt { x ) nojjV log7r(a;) for any x, 
but can only estimate it unbiasedly via if (x). Replacing tt ^ x ) with ti { x ) results in PM-MCMC IllO, 
which asymptotically remains exact {exact-approximate inference). 

(Kernel) Adaptive Metropolis-Hastings In the absence of V log tt, the usual choice of Q is a 
random walk, i.e. Q{-\xt) = Af{-\xt,^t)- A popular choice of the scaling is Ht oc I. When the 
scale of the target density is not uniform across dimensions, or if there are strong correlations, the 
/kMH algorithm Go] El improves mixing by adaptively learning global covariance structure of tt 
from the history of the Markov chain. For cases where the local scaling does not match the global 
covariance of tt, i.e. the support of the target is nonlinear, KAMH ifl^ improves mixing by learning 
the target covariance in a RKHS. KAMH proposals are Gaussian with a covariance that matches the 
local covariance of tt around the current state xt, without requiring access to V log tt. 

Hamiltonian Monte Carlo Hamiltonian Monte Carlo (HMC) uses deterministic, measure¬ 
preserving maps to generate efficient Markov transitions llllll. Starting from the negative log 
target, referred to as the potential energy U{q) = — log TT{q), we introduce an auxiliary momen¬ 
tum variable p ~ exp(—Ar(p)) with p G X. The joint distribution of {p, q) is then proportional 
to exp {—H{p,q)), where H{p,q) := K{p) -\- U{q) is called the Hamiltonian. H{p,q) defines a 
Hamiltonian flow, parametrised by a trajectory length t G M., which is a map : (p, q) i—)■ {p* ,q*) 
for which H{p* ,q*) = H{p, q). This allows constructing 7r-invariant Markov chains: for a chain at 
state q = Xt, repeatedly (i) re-sample p' ^ exp(—iT(-)), and then (ii) apply the Hamiltonian flow 


*The compactness restriction is imposed to satisfy the assumptions in fo). 

^TT is analytically intractable, as opposed to computationally expensive in the Big Data context. 
^Throughout the paper V denotes the gradient operator w.r.t. to x. 
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for time t, giving {p*,q*) = (j)^{p', q). The flow can be generated by the Hamiltonian operator 

dK d dU d 
dp dq dq dp 

In practice, Q is usually unavailable and we need to resort to approximations. Here, we limit our¬ 
selves to the leap-frog integrator; see III for details. To correct for discretisation error, a Metropolis 
acceptance procedure can be applied: starting from {p',q), the end-point of the approximate tra¬ 
jectory is accepted with probability min [1, exp {—H{p*,q*) -f H{p', q))]. HMC is often able to 
propose distant, uncorrelated moves with a high acceptance probability. 

Intractable densities In many cases the gradient of log Tr{q) = —U{q) cannot be written in closed 
form, leaving random-walk based methods as the state-of-the-art iniiii. We aim to overcome 
random-walk behaviour, so as to obtain significantly more efficient sampling m. 

3 Kernel Induced Hamiltonian Dynamics 

KMC replaces the potential energy in ([T]i by a kernel induced surrogate computed from the history of 
the Markov chain. This surrogate does not require gradients of the log-target density. The surrogate 
induces a kernel Hamiltonian flow, which can be numerically simulated using standard leap-frog 
integration. As with the discretisation error in HMC, any deviation of the kernel induced flow 
from the true flow is corrected via a Metropolis acceptance procedure. This here also contains 
the estimation noise from tt and re-uses previous values of if, c.f. 0 Table 1]. Consequently, 
the stationary distribution of the chain remains correct, given that we take care when adapting the 
surrogate. 

Infinite Dimensional Exponential Families in a RKHS We construct a kernel induced potential 
energy surrogate whose gradients approximate the gradients of the true potential energy U in Q. 
without accessing tt or Vtt directly, but only using the history of the Markov chain. To that end, we 
model the (unnormalised) target density tv { x ) with an infinite dimensional exponential family model 
Ga of the form 

const X 7r(a:) « exp ((/, k{x, ■))h - A{f)), (2) 

which in particular implies V/ « — Vt/ = VlogTr. Here H is a RKHS of real valued functions 
on X. The RKHS has a uniquely associated symmetric, positive definite kernel k : X x X ^ W, 
which satisfies f{x) = {f,k{x,-))'M. for any f G H lfT9l . The canonical feature map k{-,x) G H 
here takes the role of the sufficient statistics while f G H. am the natural parameters, and 
A{f) := log exp((/, k{x, ■))'H)dx is the cumulant generating function. Eq. ([^ defines broad 
class of densities: when universal kernels are used, the family is dense in the space of continuous 
densities on compact domains, with respect to e.g. Total Variation and KL ifTSl Section 3]. It is pos¬ 
sible to consistently fit an unnormalised version of (|^ by directly minimising the expected gradient 
mismatch between the model (|^ and the true target density tt (observed through the Markov chain 
history). This is achieved by generalising the score matching approach na to infinite dimensional 
parameter spaces. The technique avoids the problem of dealing with the intractable A{f), and re¬ 
duces the problem to solving a linear system. More importantly, the approach is observed to be 
relatively robust to increasing dimensions. We return to estimation in Section]^ where we develop 
two efficient approximations. For now, assume access to an / € "H such that V/(x) « V log 7r(x). 


Kernel Induced Hamiltonian Flow We define a kernel induced Hamiltonian operator by replac¬ 
ing U in the potential energy part ^ in Q by our kernel surrogate [/^ = /. It is clear that, 
depending on Uk, the resulting kernel induced Hamiltonian flow differs from the original one. That 
said, any bias on the resulting Markov chain, in addition to discretisation error from the leap-frog 
integrator, is naturally corrected for in the Pseudo-Marginal Metropolis step. We accept an end-point 
(j)^^ {p', q) of a trajectory starting at (p', q) along the kernel induced flow with probability 


1, exp ( —H 


(- 


Hk! 


{p',q)] XH{p',q 


(3) 


where H 


(^4'^'° {p',q)^ corresponds to the true Hamiltonian at (j/,q). Here, in the Pseudo- 

Marginal context, we replace both terms in the ratio in ([^ by unbiased estimates, i.e., we replace 


3 




HMC Acceptance prob. KMC Acceptance prob. 



Figure 1: Hamiltonian trajectories on a 2-dimensional standard Gaussian. End points of such trajec¬ 
tories (red stars to blue stars) form the proposal of HMC-like algorithms. Left: Plain Hamiltonian 
trajectories oscillate on a stable orbit, and acceptance probability is close to one. Right: Kernel 
induced trajectories and acceptance probabilities on an estimated energy function. 


7r(g) within H with an unbiased estimator 7r(g). Note that this also involves ‘recycling’ the es¬ 
timates of H from previous iterations to ensure anyymptotic correctness, c.f. in Table 1]. Any 
deviations of the kernel induced flow from the true flow result in a decreased acceptance probability 
0. We therefore need to control the approximation quality of the kernel induced potential energy 
to maintain high acceptance probability in practice. See Figure[2for an illustrative example. 

4 Two Efficient Estimators for Exponential Families in RKHS 

We now address estimating the infinite dimensional exponential family model (|^ from data. The 
original estimator in m has a large computational cost. This is problematic in the adaptive MCMC 
context, where the model has to be updated on a regular basis. We propose two efficient approxima¬ 
tions, each with its strengths and weaknesses. Both are based on score matching. 

4.1 Score Matching 

Following El , we model an unnormalised log probability density log tt{x) with a parametric model 

log Tf^ (x; /) := log if (a:; /) - log Z(/), (4) 

where / is a collection of parameters of yet unspecified dimension (c.f natural parameters of (|^), 
and Z{f) is an unknown normalising constant. We aim to And / from a set of n sample^ 22 := 
{xi}7=i ~ that 7r{a:) « 7f(a:; /) x const. From lfT4l Eq. 2], the criterion being optimised is 

the expected squared distance between gradients of the log density, so-called score functions, 

J{f) = ^ J^T^{x)\\W\og^x;f)-W\ogTT{x)\\ldx, 

where we note that the normalising constants vanish from taking the gradient V. As shown in lfT4l 
Theorem 1], it is possible to compute an empirical version without accessing 7r(x) or V log7r(a;) 
other than through observed samples, 

\ogn{x-J) 1 / d\og^{x-J) \ ^1 
dxj 2 \ dxi ) 

Our approximations of the original model (|^ are based on minimising (|^ using approximate scores. 

4.2 Infinite Dimensional Exponential Families Lite 

The original estimator of / in (|^ takes a dual form in a RKHS sub-space spanned by nd -f 1 kernel 
derivatives, lfT3l Thm. 4]. The update of the proposal at the iteration t of MCMC requires inversion 
of a (td -f 1) X {td 4-1) matrix. This is clearly prohibitive if we are to run even a moderate number 
of iterations of a Markov chain. Following ifT^ . we take a simple approach to avoid prohibitive 
computational costs in t: we form a proposal using a random sub-sample of fixed size n from the 
Markov chain history, z := {zi}f^i C {xi}l^i. In order to avoid excessive computation when d is 
large, we replace the full dual solution with a solution in terms of span {{k{zi, - jliLi), which covers 
the support of the true density by construction, and grows with increasing n. That is, we assume that 
the model Q takes the ‘light’ form 

‘'We assume a fixed sample set here but will use both the full chain history {xi}i^i or a sub-sample later. 




xGV 1^1 


4 









































































































f{x)='^aik{z^,x), ( 6 ) 

2 — 1 

where a S K" are real valued parameters that are obtained by minimising the empirical score match¬ 
ing objective This representation is of a form similar to EOl Section 4.1], the main differences 
being that the basis functions are chosen randomly, the basis set grows with n, and we will require 
an additional regularising term. The estimator is summarised in the following proposition, which is 
proved in Appendix [A| 

Proposition 1. Given a set of samples z = {zi}2^i and assuming f{x) = X)r=i x) for the 

Gaussian kernel of the form k{x, y) = exp (— o.nd A > 0, the unique minimiser of 
the X\\ fW^-regularised empirical score matching objective Q is given by 

ax =-^{C + XI)-\ (7) 

where b G ffi” and C G are given by 

b = ^ + Ds,Kl - 2D^,Kxi) - Kl\ andC = J2 [KD^, - D^,K], 

with entry-wise products si := x^ © xi and := diag(x). 


The estimator costs 0{n^ + dnf) computation (for computing C, b, and for inverting C) and 0{nf) 
storage, for a fixed random chain history sub-sample size n. This can be further reduced via low-rank 
approximations to the kernel matrix and conjugate gradient methods, which are derived in Appendix 

s 

Gradients of the model are given as V/(x) = aiX/k{x, Xi), i.e. they simply require to evalu¬ 

ate gradients of the kernel function. Evaluation and storage of V/( ) both cost 0{dn). 


4.3 Exponential Families in Finite Feature Spaces 

Instead of fitting an infinite-dimensional model on a subset of the available data, the second estimator 
is based on fitting a finite dimensional approximation using all available data {xi}\=i, in primal 
form. As we will see, updating the estimator when a new data point arrives can be done online. 


Define an m-dimensional approximate feature space Tfm = K™. and denote by fx € 'Hm the 
embedding of a point x € A” = into Tfm = Assume that the embedding approximates the 
kernel function as a finite rank expansion k{x,y) « <fl4’y The log unnormalised density of the 
infinite model (|^ can be approximated by assuming the model in Q takes the form 

/(x) = (6>, = d^fx (8) 

To fit 0 € K"*, we again minimise the score matching objective Q, as proved in Appendix [B] 
Proposition 2. Given a set of samples x = and assuming f{x) = 9^ (f>x far a finite 

dimensional feature embedding x i-G- fix & K"*. ond A > 0, the unique minimiser of the A||0||2- 
regularised empirical score matching objective 0 is given by 

ex-.= {C + XT)-\ (9) 


where 


z a 

i=i 


C := 




2=1 €=1 


An example feature embedding based on random Fourier features iniizii and a standard Gaussian 
kernel is <j>x = [cosjojf x -|- ui), ..., cos^uif^x -|- Um)] , with uji ~ Af{uj) and m ~ Uniform[0,27r]. 

The estimator has a one-off cost of 0{tdmf -f m^) computation and 0{mf) storage. Given that we 
have computed a solution based on the Markov chain history {xi}l^i, however, it is straightforward 
to update C, b, and the solution 9\ online, after a new point Xt+i arrives. This is achieved by 
storing running averages and performing low-rank updates of matrix inversions, and costs 0{dmf) 
computation and Oirvf) storage, independent of t. Further details are given in Appendix]^ 

Gradients of the model are XJf{x) = 9 , i.e., they require the evaluation of the gradient of 

the feature space embedding, costing 0{md) computation and and 0{m) storage. 
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Algorithm 1 Kernel Hamiltonian Monte Carlo - Pseudo-code 


Input'. Target (possibly noisy estimator) tt, adaptation schedule at, HMC parameters. 
Size of basis m or sub-sample size n. 

At iteration f + 1, current state xt, history {xi}l^i, perform (1-4) with probability at 

KMC life: KMC finite: 


1 . 

2 . 

3. 

4. 

5. 

6 . 


Update sub-sample z C {xi}l^i 
Re-compute C, b from Prop. 
Solve Q.\ = —f (C -1- \I)~^b 
V/(a:) ^ ELi aiVfe(a:, zt) 


1. Update to C, b from Prop. 

2. Perform rank-d update to 

3. Update §\ = {C + XI)~^b 

4. Vfix) ^ 


Propose (p'j X*) with kernel induced Hamiltonian flow, using 

Perform Metropolis step using tt: accept Xt+i x* w.p. 0 and reject Xt+i Xt otherwise 
If TT is noisy and x* was accepted, store above 7f(a;*) for evaluating 0 in the next iteration 


5 Kernel Hamiltonian Monte Carlo 

Constructing a kernel induced Hamiltonian flow as in Sectionj^from the gradients of the infinite di¬ 
mensional exponential family model Q, and approximate estimators (0,0, we arrive at a gradient 
free, adaptive MCMC algorithm; Kernel Hamiltonian Monte Carlo (Algorithm^. 

Computational Efficiency, Geometric Ergodicity, and Burn-in KMC finite using 0 allows for 
online updates using the full Markov chain history, and therefore is a more elegant solution than 
KMC lite, which has greater computational cost and requires sub-sampling the chain history. Due 
to the parametric nature of KMC finite, however, the tails of the estimator are not guaranteed to 
decay. For example, the random Fourier feature embedding described below Propositionj^contains 
periodic cosine functions, and therefore oscillates in the tails of 0, resulting in a reduced acceptance 
probability. As we will demonstrate in the experiments, this problem does not appear when KMC 
finite is initialised in high-density regions, nor after bum-in. In situations where information about 
the target density support is unknown, and during burn-in, we suggest to use the lite estimator 0, 
whose gradients decay outside of the training data. As a result, KMC lite is guaranteed to fall back 
to a Random Walk Metropolis in unexplored regions, inheriting its convergence properties, and 
smoothly transitions to HMC-like proposals as the MCMC chain grows. A proof of the proposition 
below can be found in Appendix [C| 

Proposition 3. Assume d = 1, 7r(a;) has log-concave tails, the regularity conditions of 4221 Thm 
2.2] (implying n-irreducibility and smallness of compact sets), that MCMC adaptation stops after 
a fixed time, and a fixed number L of e-leapfrog steps, /f limsup|| 2 ,|| 2 _j,oo ||V/(a :)||2 = 0, and 
3M : \/x : ||V/(a :)||2 < M, then KMC lite is geometrically ergodic from n-almost any starting 
point. 

Vanishing adaptation MCMC algorithms that use the history of the Markov chain for construct¬ 
ing proposals might not be asymptotically correct. We follow lIT^ Sec. 4.2] and the idea of ‘vanish¬ 
ing adaptation’ cn, to avoid such biases. Let be a schedule of decaying probabilities such 

that limt_,,oo = 0 and update the density gradient estimate according to this 

schedule in Algorithm[T] Intuitively, adaptation becomes less likely as the MCMC chain progresses, 
but never fully stops, while sharing asymptotic convergence with adaptation that stops at a fixed 
point ll^ Theorem 1]. Note that Proposition[^is a stronger statement about the convergence rate. 

Free Parameters KMC has two free parameters: the Gaussian kernel bandwidth a, and the regu- 
larisation parameter A. As KMC’s performance depends on the quality of the approximate infinite 
dimensional exponential family model in 0 or 0, a principled approach is to use the score match¬ 
ing objective function in 0 to choose a, A pairs via cross-validation (using e.g. ‘hot-started’ black¬ 
box optimisation). Earlier adaptive kernel-based MCMC methods ifT^ did not address parameter 
choice. 

6 Experiments 

We start by quantifying performance of KMC finite on synthetic targets. We emphasise that these 
results can be reproduced with the lite version. 
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Figure 2: Hypothetical acceptance probability of KMC finite on a challening target in growing 
dimensions. Left: As a function of n = m (x-axis) and d (y-axis). Middle/right: Slices through 
left plot with error bars for fixed n = m and as a function of d (left), and for fixed d as a function of 
n = m (right). 




Minimum ESS 



0 500 1000 1500 2000 


Figure 3; Results for the 8-dimensional synthetic Banana. As the amout of observed data increases, 
KMC performance approaches HMC - outperforming KAMH and RW. 80% error bars over 30 mns. 


KMC Finite; Stability of Trajectories in High Dimensions In order to quantify efficiency in 
growing dimensions, we study hypothetical acceptance rates along trajectories on the kernel induced 
Hamiltonian flow (no MCMC yet) on a challenging Gaussian target; We sample the diagonal entries 
of the covariance matrix from a Gamma (1,1) distribution and rotate with a uniformly sampled 
random orthogonal matrix. The resulting target is challenging to estimate due to its ‘non-singular 
smoothness’, i.e., substantially differing length-scales across its principal components. As a single 
Gaussian kernel is not able to effeciently represent such scaling families, we use a rational quadratic 
kernel for the gradient estimation, whose random features are straightforward to compute. Figure 
l^shows the average acceptance over 100 independent trials as a function of the number of (ground 
truth) samples and basis functions, which are set to be equal n = m, and of dimension d. In low to 
moderate dimensions, gradients of the finite estimator lead to acceptance rates comparable to plain 
FIMC. On targets with more ‘regular’ smoothness, the estimator performs well in up to d « 100, 
with less variance. See Appendix |D . 1 1 for details. 

KMC Finite: HMC-like Mixing on a Synthetic Example We next show that KMC’s perfor¬ 
mance approaches that of HMC as it sees more data. We compare KMC, HMC, an isotropic random 
walk (RW), and KAMH on the 8-dimensional nonlinear banana-shaped target; see Appendix |D.2| 
We here only quantify mixing after a sufficient burn-in (bum-in speed is included in next example). 
We quantify performance on estimating the target’s mean, which is exactly 0. We tuned the scaling 
of KAMH and RW to achieve 23% acceptance. We set HMC parameters to achieve 80% acceptance 
and then used the same parameters for KMC. We ran all samplers for 2000H-200 iterations from a 
random start point, discarded the burn-in and computed acceptance rates, the norm of the empirical 
mean ||E[a;]||, and the minimum effective sample size (ESS) across dimensions. For KAMH and 
KMC, we repeated the experiment for an increasing number of bum-in samples and basis functions 
m = n. Figure shows the results as a function of m = n. KMC clearly outperforms RW and 
KAMH, and eventually achieves performance close to HMC asn = m grows. 

KMC Lite: Pseudo-Marginal MCMC for GP Classification on Real World Data We next 
apply KMC to sample from the marginal posterior over hyper-parameters of a Gaussian Process 
Classification (GPC) model on the UCI Glass dataset ll24ll . Classical HMC cannot be used for this 
problem, due to the intractability of the marginal data likelihood. Our experimental protocol mostly 
follows lfT2l Section 5.1], see Appendix |D. 3 1 but uses only 6000 MCMC iterations without discard¬ 
ing a bum-in, i.e., we study how fast KMC initially explores the target. We compare convergence in 
terms of all mixed moments of order up to 3 to a set of benchmark samples (MMD ||25]| . lower is bet¬ 
ter). KMC randomly uses between 1 and 10 leapfrog steps of a size chosen uniformly in [0.01, 0.1], 
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Figure 4; Left: Results for 9-dimensional marginal posterior over length scales of a GPC model 
applied to the UCI Glass dataset. The plots shows convergence (no burn-in discarded) of all mixed 
moments up to order 3 (lower MMD is better). Middle/right; ABC-MCMC auto-correlation and 
marginal Q\ posterior for a 10-dimensional skew normal likelihood. While KMC mixes as well as 
HABC, it does not suffer from any bias (overlaps with RW, while HABC is significantly different) 
and requires fewer simulations per proposal. 


a standard Gaussian momentum, and a kernel tuned by cross-validation, see Appendix |D. 3 1 We did 
not extensively tune the HMC parameters of KMC as the described settings were sufficient. Both 
KMC and KAMH used 1000 samples from the chain history. Figure|^(left) shows that KMC’s burn- 
in contains a short ‘exploration phase’ where produced estimates are bad, due to it falling back to a 
random walk in unexplored regions, c.f. Proposition]^ From around 500 iterations, however, KMC 
clearly outperforms both RW and the earlier state-of-the-art KAMFI. These results are backed by the 
minimum ESS (not plotted), which is around 415 for KMC and is around 35 and 25 for KAMH and 
RW, respectively. Note that all samplers effectively stop improving from 3000 iterations - indicating 
a burn-in bias. All samplers took Ih time, with most time spent estimating the marginal likelihood. 
KMC Lite: Reduced Simulations and no Additional Bias in ABC We now apply KMC in the 
context of Approximate Bayesian Computation (ABC), which often is employed when the data like¬ 
lihood is intractable but can be obtained by simulation, see e.g. 0. ABC-MCMC Ha targets an 
approximate posterior by constructing an unbiased Monte Carlo estimator of the approximate like¬ 
lihood. As each such evaluation requires expensive simulations from the likelihood, the goal of all 
ABC methods is to reduce the number of such simulations. Accordingly, Hamiltonian ABC was 
recently proposed a, combining the synthetic likelihood approach ^2S\ with gradients based on 
stochastic finite differences. We remark that this requires to simulate from the likelihood in every 
leapfrog step, and that the additional bias from the Gaussian likelihood approximation can be prob¬ 
lematic. In contrast, KMC does not require simulations to construct a proposal, but rather ‘invests’ 
simulations into an accept/reject step that ensures convergence to the original ABC target. Fig¬ 
ure ^(right) compares performance of RW, HABC (sticky random numbers and SPAS, 18] Sec. 4.3, 
4.4]), and KMC on a 10-dimensional skew-normal distribution p{y\0) = 2JV {9,1) T* ((a, y)) with 
0 = a = 1 • 10. KMC mixes as well as HABC, but HABC suffers from a severe bias. KMC also 
reduces the number of simulations per proposal by a factor 2L = 100. See Appendix |D.4| for details. 

7 Discussion 

We have introduced KMC, a kernel-based gradient free adaptive MCMC algorithm that mimics 
HMC’s behaviour by estimating target gradients in an RKHS. In experiments, KMC outperforms 
random walk based sampling methods in up to d = 50 dimensions, including the recent kernel- 
based KAMH Hal. KMC is particularly useful when gradients of the target density are unavailable, 
as in PM-MCMC or ABC-MCMC, where classical HMC cannot be used. We have proposed two 
efficient empirical estimators for the target gradients, each with different strengths and weaknesses, 
and have given experimental evidence for the robustness of both. 

Future work includes establishing theoretical consistency and uniform convergence rates for the 
empirical estimators, for example via using recent analysis of random Fourier Features with tight 
bounds ED, and a thorough experimental study in the ABC-MCMC context where we see a lot of 
potential for KMC. It might also be possible to use KMC as a precomputing strategy to speed up 
classical HMC as in ED- For code, see https://github.com/karInapf/kernel_hmc 
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Appendix 


The Appendix contains proofs for Propositions and as well as additional computational details 
for both KMC lite in Section|^and KMC finite in Section]^ Sectionj^covers the proof of geometric 
ergodicity of KMC lite from Proposition]^ Sectionj^describes further experimental details. 

A Lite Estimator 

Proof of Propositionj^ 

The proof below extends the model in ll20l Section 4.1]. We assume that the model log-density Q 
takes the form in Proposition [T] then directly implement score functions (|^, from which we derive 
an empirical score matching objective as a system of linear equations. 

Proof. As assumed the log unnormalised density takes the form 


/(^) = y^^aik{x^,x) 


where k x 


' is the Gaussian kernel in the form 


k{xi,x) = exp - *11^^ = exp '^{xu - 


The score functions for Q are then given by 

ipe{x;a) := 

and 


(91og^(x;/) 2-^ f ||xi-x||2 

= - > ai{x^e - xf) exp- 

(7 ^^ V 


dxe 


diiii{x]a) := 


log^(x;/) 

d'^xi 


2^ ( ||x.-xf^ , , ^2 M 

= --2^aiexpl-- Xf) expl-- 

2=1 ^ / \ / 1—-^ \ 


Xi - x\\ 


a 


= -^a,exp(- 


i=l 


Xi - X 


-1 -f -{xa - xeY 
a 


Substituting this into (|^ yields 

^ n d 


di^lje{xi;a) + 


2=1 €=1 
. d n n 


= —- 


2=1 j = l 


= 1 : 1 : 


02 \ 


1 “t“ (.Xi£ Xj^j 


na 


£=1 i=l 


( 

{x^t - xu) exp f - 
i=i 



1- 

(N 

H 

1 

1 

a J 


n 2 


We now rewrite J(a) in matrix form. The expression for the term J{a) being optimised is the sum 
of two terms. 


First Term: 


i=i i-i j-i 


( \\x^ 

r 

\ 

- )[ 


1 “h (^Xi£ Xji) 
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We only need to compute 

f: o. exp (- ){x„-x„f 

i=l j=l ' ^ / 

=EE ai exp —^y_'\ ^^2^ ^2^ _ 2xiiXjt) . 

i=l j=l ' ^ / 


Define 


:= [ Xxi 




The final term may be computed with the right ordering of operations, 


—2{a © Xi)^Kxi, 


where a © xr is the entry-wise product. The remaining terms are sums with constant row or column 
terms. Define si := XiQ Xi with components sn = x\(^. Then 


n n 

EE aikijSji = K Si. 

i=ij=i 


Likewise 

n n 

EE aixjikij = (a © Si)^Kl. 

i=i j=i 


Second Term; Considering only the ^-th dimension, this is 

n n 

E E aj(xji - xu) exp 

i=i \_j=i 

In matrix notation, the inner sum is a column vector, 

K{a © Xi) — (Ka) © X£. 

We take the entry-wise square and sum the resulting vector. Denote by := diag(x), then the 
following two relations hold 



K{aQ x) = KD^a, 
(Ka) Q X = DxKa. 

This means that J(a) as defined previously. 


J{a) = — 

nrr < ^ 


^=1 

d 


— a^Ksi -I- (a © se)^Kl — 2(a © xi)^Kxi\ — K1 


H-^ y]] [(ct © Xi)^ K — xj Q {a^K)\ [K{a © X() — {Ka) © Xi] 


£=1 


can be rewritten as 


J(a) = — 
ncr 


— {Ksi -I- DsiKl — 2DxiKxi) — K1 




na 

2 


\£=1 

2 


= —a b -\ - -a Ca, 

na na^ 
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where 



d 

i=i 

Assuming C is invertible, this is minimised by 



b. 


□ 


As in na, we add a term A||/|||^ for A S K’*', in order to control the norm of the natural parameters 
in the RKHS ||/||^- This results in the regularised and numerically more stable solution a\ := 
(C +A/)-i&. 

Reduced Computational Costs via Low-rank Approximations and Conjugate Gradient 

Solving the linear system in Q requires 0{n^) computation and 0{n?) storage for a fixed random 
sub-sample of the chain history z. In order to allow for large n, and to exploit potential manifold 
structure in the RKHS, we apply a low-rank approximation to the kernel matrix via incomplete 
Cholesky ll28l Alg. 5.12], that is a standard way to achieve linear computational costs for kernel 
methods. We rewrite the kernel matrix 

K Ki LL^, 

where L G is obtained via dual partial Gram-Schmidt orthonormalisation and costs both 

0(ni) computation and storage. Usually i n, and £ can be chosen via an accuracy cut-off 
parameter on the kernel spectrum in the same fashion as for other low-rank approximations, such as 
pcaQ Given such a representation of K, we can rewrite any matrix-vector product as 

Kb^ {LL^)b = L{L^b), 

where each left multiplication of L costs 0{nt) and we never need to store LL^. This idea can be 
used to achieve costs of 0{ni) when computing b, and left-multiplying C. Combining the technique 
with conjugate gradient (CG) allows to solve 0 with a maximum of n such matrix-vector products, 
yielding a total computational cost of In practice, we can monitor residuals and stop CG 

after a fixed number of iterations t n, where t depends on the decay of the spectrum of K. 
We arrive at a total cost of 0{nlT) computation and 0(n£) storage. CG also has the advantage of 
allowing for ’hot starts’, i.e. initialising the linear solver at a previous solution. Further details can 
be found in our implementation. 


B Finite Feature Space Estimator 

Proof of Proposition!^ 

We assume the model log-density Q takes the primal form in a finite dimensional feature space as 
in Proposition then again directly implement score functions in 0 and minimise it via a linear 
solve. 


Proof. As assumed the log unnormalised density takes the form 

fix) = {0,fx)'Hrr. = 


^In this paper, we solely use the Gaussian kernel, whose spectrum decays exponentially fast. 
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where a; S is embedded into a finite dimensional feature space T-Lm = 
functions in Q then can be written as the simple linear form 


:= 


aiogif(x;6») -Tit? 


dxi 


= 0'4>l. and := 


as X (j>x- The score 
d'^ log7f(a:; 9) 


aT 7^ 


dxj 


( 10 ) 


where we defined the m-dimensional feature vector derivatives 


Plugging those into the empirical score matching objective in Q, we arrive at 

n d r 


■■= and := -^c, 


2=1 e^i 

^ n d 

= ^EE 


di^i{xi-9) + 


i=l £=1 




= -e^ce - e^b 

2 


( 11 ) 


where 




( 12 ) 


6:=—EE<^'. and C ^ ^ 

i=l r=i i=l i=l 

Assuming that C is invertible (trivial for n > m), the objective is uniquely minimised by differenti¬ 
ating ([TT) wrt. 9, setting to zero, and solving for 9. This gives 

9-=0-^9. (13) 

□ 

Again, similar to ifTSl . we add a term A/2||6*|j| for A G K+ to ( [TT] l, in order to control the norm of 
the natural parameters 9 G This results in the regularised and numerically more stable solution 

9y, := {C + \l)-^b. 

Next, we give an example for the approximate feature space T-Lm- Note that the above approach can 
be combined with any set of finite dimensional approximate feature mappings (p^. 

Example: Random Fourier Features for the Gaussian Kernel 

We now combine the finite dimensional approximate infinite dimensional exponential family model 
with the “random kitchen sink” ini. Assume a translation invariant kernel k{x,y) = k{x — y). 
Bochner’s theorem gives the representation 

k{x, y) = k{x-y)= / exp {iu:^{x - y)) dr(a;), 

where r(a;) is the Fourier transform of the kernel. An approximate feature mapping for such kernels 
can be obtained via dropping imaginary terms and approximating the integral with Monte Carlo 
integration. This gives 

4>x = \ — [cos(a;i'^ x + ui),..., cos(a;^a: + Um)] , 

\m 

with fixed random basis vector realisations that depend on the kernel via r(a;), 

Wi ~ r(a;), 

and fixed random offset realisations 

Ui ~ Unif orin[0, 27r], 

for i = 1... m. It is easy to see that this approximation is consistent for m —> oo, i.e. 

£^.,.6 [(plcpy] = k{x,y). 
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See ifTTll for details and a uniform convergence bound and ETIl for a more detailed analysis with 
tighter bounds. Note that it is possible to achieve logarithmic computational costs in d exploiting 
properties of Hadamard matrices ||2^ . 

The feature map derivatives ([TOli are given by 


2 d 




= - \ — [sin(wf^ + ui)uju, ... ,sin(a;^^ + Um)uJme] 


= -\l — [sin(wf^ + ui ),... ,sin(a;^^ + Um)] © [ww, ■ ■ ■ ,Wmr], 


where ujji is the f-th component of ujj, and 


2 d 


^ [sin(wf^ + ui), ... ,sin(w^^ + Um)] © [ww,.. ■,ujm£] 


m d^i 
2 

m 


= ~\l~ [cos(wf^ +tilcos(a;^^ + 'U™)] © ..., 

\m 

4'^ © ■ I ^mil I 

where © is the element-wise product. Consequently the gradient is given by 


= 




d 

L^J 


])dxm 


As an example, the translation invariant Gaussian kernel and its Fourier transform are 
fc(a;, 2 /) = exp (-cr"^||a; - 2 /II 2 ) and r(a;) = Af 0, 


Constant Cost Updates 

A convenient property of the finite feature space approximation is that its primal representation of 
the solution allows to update ( |T2l i in an online fashion. When combined with MCMC, each new 
point Xt+i of the Markov chain history only adds a term of the form — ^xt+i ^ 

^ ]^mx7Ti jQ jjjg proving averages of b and C respectively. Consequently, at 
iteration t, rather than fully re-computing ( [T3] l at the cost of 0{tdm? + w?) for every new point, we 
can use rank-d updates to construct the minimiser of GD from the solution of the previous iteration. 
Assume we have computed the sum of all moving average terms, 

from feature vectors derivatives S K™ of some set of points and subsequently receive 

receive a new point Xt+i- We can then write the inverse of the new sum as 

This is the inverse of the rank-d perturbed previous matrix C*. We can therefore construct this 
inverse using d successive applications of the Sherman-Morrison-Woodbury formula for rank-one 
updates, each using 0{m?) computation. Since Ct is positive definita^ we can represent its inverse 
as a numerically much more stable Cholesky factorisation Ct = LtLl~4li is also possible to perform 

is the empirical covariance of the feature derivatives ■ 
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cheap rank-d updates of such Cholesky factor^ Denote by bt the sum of the moving average b. We 
solve as 

9 = C-^b = Qct) 

using cheap triangular back-substitution from Lt, and never storing or explicitly. 

Using such updates, the computational costs for updating the approximate infinite dimensional ex¬ 
ponential family model in every iteration of the Markov chain are 0{dm'^), which constant in t. We 
can therefore use all points in the history for constructing a proposal. See our implementation for 
further details. 

Algorithmic Description: 

1. Update sums 

d 1 ^ 

bt+i = bt-'^ and Ct+i = + 9 X! (<^ 1+1 )^' 

e=i ^ t=\ 

2. Perform rank-d update to obtain updated Cholesky factorisation = Ct+i- 

3. Update approximate infinite dimensional exponential family parameters 

9 = 


C Ergodicity of KMC life 


Notation Denote by Q;(a;t, x*{p')) the probability of accepting a (p', a;*) proposal at state Xt. Let 
a Ab = min(a, 6). Define := Le^V log7r(a:^°))/2 -f ~ *)V log7r(a:^*'^) and 

d(x^^^) := e(Vf(x^^^) + Vf(x^-'^^^))/2 + where x^^^^ is the i-th point of the 

leapfrog integration from x = 

Proof of Proposition]^ 

Proof. We assumed 7r(a:) is log-concave in the tails, meaning 3xu > 0 s.t. for x* > Xt > xjj, 
we have tt{ x*)/ tt( xt) < Il 2 -||a:t|| 2 ) fQj- xt > x* > xjj, we have TT{x*)/TT{xt) > 


o-ai(\\x II 2 


-ILtlb)^ and a similar condition holds in the negative tail. Furthermore, we assumed 
fixed HMC parameters: L leapfrog steps of size e, and wlog the identity mass matrix /. Following 
1 , it is sufficient to show 


lim sup 

||a;t||2->oo . 




a{xt,x*{p'))fj.{dp') < 0 , 


for some s > 0, where pf) is a standard Gaussian measure. Denoting the integral we split it 
into 

s s 

T~^t I T^t I TOO 

for some 5 G (0,1). We show that the first and third terms decay to zero whilst the second remains 
strictly negative as Xt —?■ 00 (a similar argument holds as Xt —>■ —00). We detail the case V/(a;) f 0 

x’’ 

as X —> 00 here, the other is analogous. Taking / * j, we can choose an Xt large enough that 

Xt — C — Lext > Xu, —71 < c{xt — xf) <0 and —72 < d{xt — xf) <0. So for p' G (0, xf) we 
have 

Lep' > X* — Xt > Lep' — 71 ^-ai{-~fi+Lep ) > -^t) > 7r(x*)/7r(xt), 

where the last inequality comes from the log-concave tails assumption. For p' G xf) 

a{xt, X*) < 1 A j exp (p'72/2 - 7I/2) < 1 A exp {-a 2 p' + ai 7 i - 71 / 2 ) , 


'We use the open-source implementation provided at https : //github. com/ jcrudy/choldate 
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Figure 5: Acceptance probability of kernel induced Hamiltonian flow for a standard Gasussian in 
high dimensions for an isotropic Gaussian. Left: As a function of n = m (x-axis) and d (y-axis). 
Middle: Slices through left plot with error bars for a fixed n = m and as a function in d (left), and 
for a fixed d as a function of n = to (right). 


where xt is large enough that a 2 = aiLe — 72/2 > 0. Similarly forp' € {'^ijLe,xf) 

e^Lep' _ ^ > gS(x*-xO _ ^ ^ > Q. 

Because 71 and 72 can be chosen to be arbitrarily small, then for large enough xt we will have 
0 < Iq* < f - 1] exp {-a 2 p' + 01171 - 72 / 2 ) p{dp') + 

J 7i/Le 
pxf 

= - l]e-“^P'Ai(dp')(14) 

J 7i/Le 

where ci = 0171 — 7 I /2 > 0 for large enough xt, as 71 and 72 are of the same order. Now turning 
to p' G {—xf, 0 ), we can use an exact rearrangement of the same argument (noting that ci can be 
made arbitrarily small) to get 

/ [e-"=P - iMdp') < 0. (15) 

‘ J-yi/Le 

Combining ( [T4l l and ( fTS] ) and rearranging as in ll30l Theorem 3.2] shows that is strictly negative 
in the limit if S 2 = sLe is chosen small enough, as can also be made arbitrarily small. 

_ 5 

For it suffices to note that the Gaussian tails of fi{-) will dominate the exponential growth of 
^s(\\x (p ilb-lktib) meaning the integral can be made arbitrarily small by choosing large enough Xt, 

and the same argument holds for 77 ■ □ 

D Additional Experimental Details 

This section contains additional details for the experiments in Section]^ 

D.l Stability in High Dimensions 

We reproduce the experiment in Figure on an isotropic Gaussian in increasing dimension. As 
length-scales across all principal components are equal, this is a significantly less challenging target 
to estimate gradients for; though still useful as a benchmark representing very smooth targets. We 
use a standard Gaussian kernel and the same experimental protocol as for Figure]^ The estimator 
works slightly better than on the target considered in Figure]^ and performs well up to d « 100, see 
Figure]^ 

D.2 Banana target 

Following Go] El, let X ~ A/^(0, S) be a multivariate normal in d > 2 dimensions, with S = 
diag(u, which undergoes the transformation X ^ Y, where Y 2 = X 2 + b{Xf — v), and 

Yi = Xi for i 2. We will write Y ~ B{b, v). It is clear that EF = 0, and that 
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B{y, b,v) = N{yi\Q,v)N(y 2 ]b{yl - f;), 1) A/'(j/j; 0,1). 

i=3 

We choose d = 8,V = 100 and b = 0.03, which corresponds to the ‘strongly twisted’ 8-dimensional 
Banana in iioiiia. The target is challenging due to the nonlinear dependence of the first two 
dimensions and the highly position dependent scaling within these dimensions. 

D.3 Pseudo-Marginal MCMC for GP Classification 

Model Closely following lfT2l . we consider a joint distribution of GP-latent variables f, labels y 
(with covariate matrix X), and hyper-parameters 6, given by 

p{i,y,S) =p(%(f|6»)p(y|f), 

where f |6* ^ A((0, ICe), with ICe modeling the covariance between latent variables evaluated at the 
input covariates: {ICg)ij = «:(xi,x'|0) = exp(^-|^^^^ ^ ^nd 0^ = log^^. This 

covariance parametrisation allows to perform Automatic Relevance Determination. We here restrict 
our attention to the binary logistic classifier, i.e. the likelihood is given by 

p{yi\h) = -7-FT, 

where G { — 1,1}. Aiming for a fully Bayesian treatment, we wish to estimate the marginal 
posterior of the hyper-parameters 9, motivated in ||4l. The marginal likelihood p{y\0) is intractable 
for non-Gaussian likelihoods p(y|f), but can be replaced with an unbiased estimate 


p{y\9) ■■= 


1 


^imp 


f^imp 

2=1 


g(fW|6»)’ 


(16) 


where {f^llE” ~ (z(f|fi*) are riimp importance samples. In HI, the importance distribution y(f|0) 
is chosen as the Laplacian or as the Expectation Propagation (EP) approximation of p(f |y, 9) cx 
p{y\i)p{i\9), leading to state-of-the-art results. 


Experimental details We here use a Laplace approximation and riimp = 100. We consider clas¬ 
sification of window against non-window glass in the UCI Glass dataset, which induces a posterior 
that has a nonlinear shape ifT^ Eigure 3]. Since the ground truth for the hyperparameter posterior is 
not available, we initially run multiple hand-tuned standard Metropolis-Hastings chains for 500,000 
iterations (with a 100,000 bum-in), keep every 1000-th sample in each of the chains, and combine 
them. The resulting samples are used as a benchmark, to evaluate the performance all algorithms. 
We use the MMD between each sampler output and the benchmark sample is computed, using the 
polynomial kernel (1 + {9,9'))^. This corresponds to the estimation error of all mixed moments of 
order up to 3. 


Cross-validation Kernel parameters are tuned using a black box Bayesian optimisation pack- 
ag^and the median heuristic for KMC and KAMH repsectively. The Bayesian optimisation uses 
standard parameters and is stopped after 15 iterations, where each trial is done via a 5-fold cross- 
validation of the score matching objective (|^. We learn parameters after MCMC 500 iterations, and 
then re-leam after 2000. We tried re-learning parameters after more iterations, but this did not lead 
to significant changes. The costs for this are neglectable in the context of PM-MCMC as estimating 
the marginal likelihood takes significantly more time than generating the KMC proposal. 


D.4 ABC MCMC 

In this section, we give a brief background on Approximate Bayesian Computation, and how KMC 
can be used within the framework. We then give details of the competing approach in the final 
experiment in Section]^ including experimental details and an analytic counterexample. 

*We use the open-source package pybo, available under https : //github . com/mwhof fman/pybo 
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Likelihood-free Models Approximate Bayesian Computation is a method for inference in the 
scenario where conditional on some parameter of interest 9, we can easily simulate data x ~ f{'\S), 
but for which the likelihood function / is unavailable 0. We however have data y which assume 
to be from the model, and we have a prior 'Ko{9). A simple ABC algorithm is to sample 0i ^ 7ro(-) 
(or any other suitable distribution), simulate data Xi ^ f{-\9i), and ‘accept’ Xi as a sample from 
the approximate posterior TTc{d\y) if d{y, x) < e. This procedure can be formalised by defining the 
approximate likelihood as 

fe{y\0) (X j g^{y\x,e)f{x\e)Ax, (17) 

where ( 7 e(t/|x, 9) is an appropriate kernel that gives more importance to points for which d{y, x) is 
smaller. In the simple case above gg{y\x, 9) = l{d(y,x)<e}- The ABC posterior is then found using 
'Ke{9\y) cx f^{y\9)TTQ{9). Often g^ is based on some low-dimensional summary statistics, which can 
have both advantages and disadvantages. 


Likelihood-free MCMC There are many different way to do ABC, and clearly not all involve 
Markov chain Monte Carlo. If the posterior however is not similar to the prior, and if 9 is more 
than three or four dimensional, MCMC is a sensible option. Since the likelihood (|T^ is intractable, 
typically algorithms are considered for which an approximation to either the likelihood, or the ABC 
posterior are used either in constructing proposals, defining Metropolis-Hastings acceptance rates, 
or both. We focus here on samplers which target ■n^{9\y) directly, c.f. Q. 


Pseudo-Marginal Metropolis-Hastings Similar to the approach taken in Section |D.3| we here 
accept proposals 9' ~ Q(9, •) where Q is some proposal mechanism (i.e. KMC), via replacing the 
likelihood with an unbiased estimate. We accept according to the ratio 


a{9,9') 


U0'\y)Qi.()\9') 


(18) 


where 7re(6*|j/) = -Ko{9)g^{y\9), and 

9e{y\d) = —^ge(.y\xi,9), ^ f{-\9i) 

niik 

I 

is a simple Monte Carlo estimator for the intractable likelihood Since it is easy to simulate from 
/ then g^{y\9) is typically easy to compute. As with other general Pseudo-Marginal schemes, and as 
mentioned below the KMC acceptance ([^, it is crucial that if 9' is accepted, the same estimate for 
^{9'\y) is used on the denominator of the Hastings ratio in future iterations until the next proposal 
is accepted for the scheme, c.f. ©Table 1]. 


We can directly adapt KMC to the ABC case via plugging in the estimated likelihood g^ in the KMC 
acceptance ratio 


Synthetic Likelihood MetropoUs-Hastings Following 1261 . one idea to approximate the in¬ 
tractable likelihood is to draw n\± samples Xi ~ f{-\9i), and fit a Gaussian approximation to /, 
producing estimates ji and S for the mean and covariance using If the error functon g^ is 

also chosen to be a Gaussian (with mean y and variance e), then the marginal likelihood f^{y\9) can 
be approximated as 

y\9^M(g,t + ^l) 

The likelihood is essentially approximated by a Gaussian fa, producing a synthetic posterior 
which is then used in the accept-reject step. Clearly some approximation error is introduced by the 
Gaussian likelihood approximation step, but as shown in l26l . it can be a reasonable choice for some 
models. 


Hamiltonian ABC Introduced in |[8|, the synthetic likelihood formulation is used to construct a 
proposal, with the accept-reject step removed altogether. Hamiltonian dynamics use the gradient 
V log7r(0) to suggest candidate values for the next state of a Markov chain which are far from the 
current point, thus increasing the chances that the chain mixes quickly. Here the gradient of the 
log-likelihood is unavailable, so is approximated with that of a Gaussian (since the map 9 —> {y,, S) 


18 



is not always clear this is done numerically, using a stochastic finite differences estimate of the 
gradient, SPAS E Sec. 4.3, 4.4]), giving 

^lik 

Vlog7r(6<) « ^ Vlog/G( 2 /i|/i,S) + V log7ro(6»). 

i=l 

Since there is no accept-reject step, the synthetic posterior is also the target of this scheme (although 
there is also further bias introduced by discretisation error), but the introduction of gradient-based 
dynamics is hoped to improve mixing and hence efficiency of inferences compared to random-walk 
type schemes. 

A Counter-example We give a very simple toy model to highlight the bias introduced by the 
Hamiltonian ABC sampler. Consider posterior inference for the mean parameter in a log-Normal 
model. Specifically, the true model is 

^ -- M{ho,to), 

- logA/'(^,T), 

where the precision r and hyper-parameters /tq, tq are known. The model is in fact conjugate, giving 
a Gaussian posterior 

I .r froHo + Tj2i'^ogx, \ 

fi\y ^— - ,To + nT . 

\ To + nr ) 

If we introduce a Gaussian approximation to the likelihood, then the mean and precision of this 
approximation fa are (empirical estimates for) 

Mg = tg = 1/Var[ri] = ^ , 

which depend on the current value for y in the chain. The resulting synthetic posterior is no longer 
tractable, but since it is one dimensional we can approximate it numerically. Using /tq = 0, tq = 
1 /100, e = 0.1 and t = 1 then the true and approximate posteriors for 100 data points generated 
using the truth ^ = 2 are shown in Figure This is a proof of concept that a likelihood with a 
positive skew being approximated by a Gaussian introduces an upwards bias to the posterior. 

Experimental details The simulation study in Sectionj^uses a slightly more complex and multi¬ 
dimensional simulation example: a 10-dimensional multivariate skew-Normal distribution, given 
by 

p{y\e) = 2A/'(6»,/)$((a,j/)) 

with 0 = a = 1 • 10. In each iteration of KMC, the likelihood is estimated via simulating nuk = 10 
samples from the above likelihood. We use the mean of all samples as summary statistic, and a 
Gaussian similarity kernel g^{y\x^ 9) with a fixed e = 0.55. Both KMC and HABC use a standard 
Gaussian momentum, a uniformly random stepsize in [0.01,0.1] and L = 50 leapfrog steps. HABC 
is used with the suggested ‘sticky random numbers’ IS] Section 4.4], i.e. we use the same seed for 
all simulations along a single proposal trajectory. Both algorithms are run for 200 -I- 5000 MCMC 
iterations. KMC then attempts to re-learn smoothness parameters, and stops adaptation. Burn-in 
samples are discarded when quantifying performance of all algorithms. 

Friction, mixing, and number of simulations HABC is used in its ‘stochastic gradient’ 171 and 
has a ‘friction’ parameter that we estimate using a running average of the global covariance of all 
SPAS gradient evaluations, HI Equation 21]. Note that we ran HABC with both the friction term 
included and removed, where we found that adding friction has severely negative impact on mixing, 
where not adding friction results in a wider posterior (with the same bias). Figure(middle, right) 
show the results without friction. Figure shows the same plots with friction. We refer to our 
implementation for further details. 

Due to the gradient estimation in every of the L = 50 leap-frog steps, every MCMC proposal for 
HABC requires 2L — 100 simulations to be generated. In contrast, KMC only requires a single 
simulation, for evaluating the accept/reject probability ([^. We leave studying the exact trade-offs of 
KMC’s learning phase and its ability to mix well as compared to HABC to future work. 
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Figure 6: Counter example showing posterior and its synthetic approximation for a simple toy 
model. 
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Figure 7; Left: Counterexample showing posterior and its synthetic approximation for a simple toy 
model Middle/right: The same results as in Figure]^ i.e. autocorrelation and marginal posterior for 
01, but here we also show performance of HABC with added friction, which has a severely negative 
impact on mixing. 


20 







































